Homework 1

Author

Mark Tello-Rincon

Primary Question: Have daily concentrations of PM2.5 decreased in California over the 20 years spanning from 2002 to 2022?


Load data

data_2002 <- read.csv("~/Library/CloudStorage/GoogleDrive-tellorin@usc.edu/.shortcut-targets-by-id/10yI1Vp2x44iBX7T-_NfWNeL7go8kwnUH/2. College - USC/1. Degree/1. Courses/Y4 Senior/Fall 2025/PM 566/Assignments/Hw 1/data/2002.csv")

data_2022 <- read.csv("~/Library/CloudStorage/GoogleDrive-tellorin@usc.edu/.shortcut-targets-by-id/10yI1Vp2x44iBX7T-_NfWNeL7go8kwnUH/2. College - USC/1. Degree/1. Courses/Y4 Senior/Fall 2025/PM 566/Assignments/Hw 1/data/2022.csv")

library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(leaflet)
library(ggridges)
library(tidyr)

Summary of Initial Findings

Dimensions

  • 2002: 22 rows & 15976 columns
  • 2022: 22 rows & 59918 columns

Variables

  • Both the 2002 and 2022 datasets use the variable Daily.Mean.PM2.5.Concentration and it is a numeric variable.
  • Other variables include data, source, site.id, and county.

Headers & Footers

  • By looking at the headers and footers for both datasets, it seems that the data has loaded properly and there are no errors.

Distribution

  • 2002:
    • Min: 0, Max: 104.30, Median: 12.00, Mean: 16.12
    • The histogram appears to be positively skewed, with the majority of the daily mean PM2.5 being between 0 and 20.
  • 2022:
    • Min: -6.7, Max: 302.5, Median: 6.8, Mean: 8.414
      • Given the negative minimum value, and high maximum value, there be some errors in the data not seen in the headers or footers.
    • The histogram appears to be generally normally distributed, with very few outlier values.
# Dimensions
dim(data_2002)
[1] 15976    22
dim(data_2022)
[1] 59918    22
# Headers and footers
head(data_2002)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/05/2002    AQS 60010007   1                           25.1 ug/m3 LC
2 01/06/2002    AQS 60010007   1                           31.6 ug/m3 LC
3 01/08/2002    AQS 60010007   1                           21.4 ug/m3 LC
4 01/11/2002    AQS 60010007   1                           25.9 ug/m3 LC
5 01/14/2002    AQS 60010007   1                           34.5 ug/m3 LC
6 01/17/2002    AQS 60010007   1                           41.0 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              81       Livermore               1              100
2              93       Livermore               1              100
3              74       Livermore               1              100
4              82       Livermore               1              100
5              98       Livermore               1              100
6             115       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         120
2              88101  PM2.5 - Local Conditions         120
3              88101  PM2.5 - Local Conditions         120
4              88101  PM2.5 - Local Conditions         120
5              88101  PM2.5 - Local Conditions         120
6              88101  PM2.5 - Local Conditions         120
                     Method.Description CBSA.Code
1 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
2 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
3 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
4 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
5 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
6 Andersen RAAS2.5-300 PM2.5 SEQ w/WINS     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(data_2002)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
15971 12/10/2002    AQS 61131003   1                             15 ug/m3 LC
15972 12/13/2002    AQS 61131003   1                             15 ug/m3 LC
15973 12/22/2002    AQS 61131003   1                              1 ug/m3 LC
15974 12/25/2002    AQS 61131003   1                             23 ug/m3 LC
15975 12/28/2002    AQS 61131003   1                              5 ug/m3 LC
15976 12/31/2002    AQS 61131003   1                              6 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
15971              62 Woodland-Gibson Road               1              100
15972              62 Woodland-Gibson Road               1              100
15973               6 Woodland-Gibson Road               1              100
15974              77 Woodland-Gibson Road               1              100
15975              28 Woodland-Gibson Road               1              100
15976              33 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
15971              88101  PM2.5 - Local Conditions         117
15972              88101  PM2.5 - Local Conditions         117
15973              88101  PM2.5 - Local Conditions         117
15974              88101  PM2.5 - Local Conditions         117
15975              88101  PM2.5 - Local Conditions         117
15976              88101  PM2.5 - Local Conditions         117
                         Method.Description CBSA.Code
15971 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15972 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15973 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15974 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15975 R & P Model 2000 PM2.5 Sampler w/WINS     40900
15976 R & P Model 2000 PM2.5 Sampler w/WINS     40900
                                    CBSA.Name State.FIPS.Code      State
15971 Sacramento--Roseville--Arden-Arcade, CA               6 California
15972 Sacramento--Roseville--Arden-Arcade, CA               6 California
15973 Sacramento--Roseville--Arden-Arcade, CA               6 California
15974 Sacramento--Roseville--Arden-Arcade, CA               6 California
15975 Sacramento--Roseville--Arden-Arcade, CA               6 California
15976 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
15971              113   Yolo      38.66121      -121.7327
15972              113   Yolo      38.66121      -121.7327
15973              113   Yolo      38.66121      -121.7327
15974              113   Yolo      38.66121      -121.7327
15975              113   Yolo      38.66121      -121.7327
15976              113   Yolo      38.66121      -121.7327
head(data_2022)
        Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
1 01/01/2022    AQS 60010007   3                           12.7 ug/m3 LC
2 01/02/2022    AQS 60010007   3                           13.9 ug/m3 LC
3 01/03/2022    AQS 60010007   3                            7.1 ug/m3 LC
4 01/04/2022    AQS 60010007   3                            3.7 ug/m3 LC
5 01/05/2022    AQS 60010007   3                            4.2 ug/m3 LC
6 01/06/2022    AQS 60010007   3                            3.8 ug/m3 LC
  Daily.AQI.Value Local.Site.Name Daily.Obs.Count Percent.Complete
1              58       Livermore               1              100
2              60       Livermore               1              100
3              39       Livermore               1              100
4              21       Livermore               1              100
5              23       Livermore               1              100
6              21       Livermore               1              100
  AQS.Parameter.Code AQS.Parameter.Description Method.Code
1              88101  PM2.5 - Local Conditions         170
2              88101  PM2.5 - Local Conditions         170
3              88101  PM2.5 - Local Conditions         170
4              88101  PM2.5 - Local Conditions         170
5              88101  PM2.5 - Local Conditions         170
6              88101  PM2.5 - Local Conditions         170
                    Method.Description CBSA.Code
1 Met One BAM-1020 Mass Monitor w/VSCC     41860
2 Met One BAM-1020 Mass Monitor w/VSCC     41860
3 Met One BAM-1020 Mass Monitor w/VSCC     41860
4 Met One BAM-1020 Mass Monitor w/VSCC     41860
5 Met One BAM-1020 Mass Monitor w/VSCC     41860
6 Met One BAM-1020 Mass Monitor w/VSCC     41860
                          CBSA.Name State.FIPS.Code      State County.FIPS.Code
1 San Francisco-Oakland-Hayward, CA               6 California                1
2 San Francisco-Oakland-Hayward, CA               6 California                1
3 San Francisco-Oakland-Hayward, CA               6 California                1
4 San Francisco-Oakland-Hayward, CA               6 California                1
5 San Francisco-Oakland-Hayward, CA               6 California                1
6 San Francisco-Oakland-Hayward, CA               6 California                1
   County Site.Latitude Site.Longitude
1 Alameda      37.68753      -121.7842
2 Alameda      37.68753      -121.7842
3 Alameda      37.68753      -121.7842
4 Alameda      37.68753      -121.7842
5 Alameda      37.68753      -121.7842
6 Alameda      37.68753      -121.7842
tail(data_2022)
            Date Source  Site.ID POC Daily.Mean.PM2.5.Concentration    Units
59913 12/01/2022    AQS 61131003   1                            3.4 ug/m3 LC
59914 12/07/2022    AQS 61131003   1                            3.8 ug/m3 LC
59915 12/13/2022    AQS 61131003   1                            6.0 ug/m3 LC
59916 12/19/2022    AQS 61131003   1                           34.8 ug/m3 LC
59917 12/25/2022    AQS 61131003   1                           23.2 ug/m3 LC
59918 12/31/2022    AQS 61131003   1                            1.0 ug/m3 LC
      Daily.AQI.Value      Local.Site.Name Daily.Obs.Count Percent.Complete
59913              19 Woodland-Gibson Road               1              100
59914              21 Woodland-Gibson Road               1              100
59915              33 Woodland-Gibson Road               1              100
59916              99 Woodland-Gibson Road               1              100
59917              77 Woodland-Gibson Road               1              100
59918               6 Woodland-Gibson Road               1              100
      AQS.Parameter.Code AQS.Parameter.Description Method.Code
59913              88101  PM2.5 - Local Conditions         145
59914              88101  PM2.5 - Local Conditions         145
59915              88101  PM2.5 - Local Conditions         145
59916              88101  PM2.5 - Local Conditions         145
59917              88101  PM2.5 - Local Conditions         145
59918              88101  PM2.5 - Local Conditions         145
                                         Method.Description CBSA.Code
59913 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59914 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59915 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59916 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59917 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
59918 R & P Model 2025 PM-2.5 Sequential Air Sampler w/VSCC     40900
                                    CBSA.Name State.FIPS.Code      State
59913 Sacramento--Roseville--Arden-Arcade, CA               6 California
59914 Sacramento--Roseville--Arden-Arcade, CA               6 California
59915 Sacramento--Roseville--Arden-Arcade, CA               6 California
59916 Sacramento--Roseville--Arden-Arcade, CA               6 California
59917 Sacramento--Roseville--Arden-Arcade, CA               6 California
59918 Sacramento--Roseville--Arden-Arcade, CA               6 California
      County.FIPS.Code County Site.Latitude Site.Longitude
59913              113   Yolo      38.66121      -121.7327
59914              113   Yolo      38.66121      -121.7327
59915              113   Yolo      38.66121      -121.7327
59916              113   Yolo      38.66121      -121.7327
59917              113   Yolo      38.66121      -121.7327
59918              113   Yolo      38.66121      -121.7327
# Variable names and types
names(data_2002)
 [1] "Date"                           "Source"                        
 [3] "Site.ID"                        "POC"                           
 [5] "Daily.Mean.PM2.5.Concentration" "Units"                         
 [7] "Daily.AQI.Value"                "Local.Site.Name"               
 [9] "Daily.Obs.Count"                "Percent.Complete"              
[11] "AQS.Parameter.Code"             "AQS.Parameter.Description"     
[13] "Method.Code"                    "Method.Description"            
[15] "CBSA.Code"                      "CBSA.Name"                     
[17] "State.FIPS.Code"                "State"                         
[19] "County.FIPS.Code"               "County"                        
[21] "Site.Latitude"                  "Site.Longitude"                
str(data_2002)
'data.frame':   15976 obs. of  22 variables:
 $ Date                          : chr  "01/05/2002" "01/06/2002" "01/08/2002" "01/11/2002" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Daily.Mean.PM2.5.Concentration: num  25.1 31.6 21.4 25.9 34.5 41 29.3 15 18.8 37.9 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  81 93 74 82 98 115 89 62 69 107 ...
 $ Local.Site.Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method.Code                   : int  120 120 120 120 120 120 120 120 120 120 ...
 $ Method.Description            : chr  "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" "Andersen RAAS2.5-300 PM2.5 SEQ w/WINS" ...
 $ CBSA.Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA.Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site.Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site.Longitude                : num  -122 -122 -122 -122 -122 ...
names(data_2022)
 [1] "Date"                           "Source"                        
 [3] "Site.ID"                        "POC"                           
 [5] "Daily.Mean.PM2.5.Concentration" "Units"                         
 [7] "Daily.AQI.Value"                "Local.Site.Name"               
 [9] "Daily.Obs.Count"                "Percent.Complete"              
[11] "AQS.Parameter.Code"             "AQS.Parameter.Description"     
[13] "Method.Code"                    "Method.Description"            
[15] "CBSA.Code"                      "CBSA.Name"                     
[17] "State.FIPS.Code"                "State"                         
[19] "County.FIPS.Code"               "County"                        
[21] "Site.Latitude"                  "Site.Longitude"                
str(data_2022)
'data.frame':   59918 obs. of  22 variables:
 $ Date                          : chr  "01/01/2022" "01/02/2022" "01/03/2022" "01/04/2022" ...
 $ Source                        : chr  "AQS" "AQS" "AQS" "AQS" ...
 $ Site.ID                       : int  60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 60010007 ...
 $ POC                           : int  3 3 3 3 3 3 3 3 3 3 ...
 $ Daily.Mean.PM2.5.Concentration: num  12.7 13.9 7.1 3.7 4.2 3.8 2.3 6.9 13.6 11.2 ...
 $ Units                         : chr  "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" "ug/m3 LC" ...
 $ Daily.AQI.Value               : int  58 60 39 21 23 21 13 38 59 55 ...
 $ Local.Site.Name               : chr  "Livermore" "Livermore" "Livermore" "Livermore" ...
 $ Daily.Obs.Count               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Percent.Complete              : num  100 100 100 100 100 100 100 100 100 100 ...
 $ AQS.Parameter.Code            : int  88101 88101 88101 88101 88101 88101 88101 88101 88101 88101 ...
 $ AQS.Parameter.Description     : chr  "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" "PM2.5 - Local Conditions" ...
 $ Method.Code                   : int  170 170 170 170 170 170 170 170 170 170 ...
 $ Method.Description            : chr  "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" "Met One BAM-1020 Mass Monitor w/VSCC" ...
 $ CBSA.Code                     : int  41860 41860 41860 41860 41860 41860 41860 41860 41860 41860 ...
 $ CBSA.Name                     : chr  "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" "San Francisco-Oakland-Hayward, CA" ...
 $ State.FIPS.Code               : int  6 6 6 6 6 6 6 6 6 6 ...
 $ State                         : chr  "California" "California" "California" "California" ...
 $ County.FIPS.Code              : int  1 1 1 1 1 1 1 1 1 1 ...
 $ County                        : chr  "Alameda" "Alameda" "Alameda" "Alameda" ...
 $ Site.Latitude                 : num  37.7 37.7 37.7 37.7 37.7 ...
 $ Site.Longitude                : num  -122 -122 -122 -122 -122 ...
# Distribution of PM2.5
summary(data_2002$Daily.Mean.PM2.5.Concentration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    7.00   12.00   16.12   20.50  104.30 
hist(data_2002$Daily.Mean.PM2.5.Concentration,
     main = "PM2.5 Distribution - 2002",
     xlab = "Daily Mean PM2.5",
     col = "lightblue")

summary(data_2022$Daily.Mean.PM2.5.Concentration)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -6.700   4.100   6.800   8.414  10.700 302.500 
hist(data_2022$Daily.Mean.PM2.5.Concentration,
     main = "PM2.5 Distribution - 2022",
     xlab = "Daily Mean PM2.5",
     col = "lightgreen")

Basic Map of Monitoring Sites

Based on the maps, it appears that the distribution remained relatively the same, however there were increases in PM2.5 measurements in the Central Valley and Bay Area from 2002 to 2022.

# Adding year column
data_2002$Year <- 2002
data_2022$Year <- 2022

# Rename PM2.5 column
names(data_2002)[names(data_2002) == "Daily.Mean.PM2.5.Concentration"] <- "PM25"
names(data_2022)[names(data_2022) == "Daily.Mean.PM2.5.Concentration"] <- "PM25"

# Combine datasets
combined_data <- rbind(data_2002, data_2022)

# Rename latitude and longitude
combined_data <- combined_data |>
  rename(Lat = Site.Latitude,
         Lon = Site.Longitude)

sites_2002 <- combined_data %>% filter(Year == 2002)
sites_2022 <- combined_data %>% filter(Year == 2022)

# Interactive map: 2002
leaflet() %>%
  addTiles() %>%
  addCircleMarkers(data = sites_2002,
                   ~Lon, ~Lat,
                   color = "darkblue",
                   radius = 3,
                   label = ~paste("2002 Site:", Local.Site.Name),
                   group = "2002")
# Interactive map: 2022
leaflet() %>%
  addTiles() %>%
  addCircleMarkers(data = sites_2022,
                   ~Lon, ~Lat,
                   color = "tomato",
                   radius = 3,
                   label = ~paste("2022 Site:", Local.Site.Name),
                   group = "2022")

Missing & Implausible Values

Though there are no missing data, there are some values which seem implausible. Specifically, values indicating a negative PM2.5 level are implausible, and of which there are 215 values in the combined dataset. When checking the proportion of negative values by year, there were 0 reported in 2002, with all 215 negative observations having been reported in 2022.

  • 2002
    • Missing values: 0
    • Implausible (negative) values: 0
  • 2022
    • Missing values: 0
    • Implausible (negative) values: 215
      • Looking at the histogram, the season with the most negative values is winter. Summer also saw an uptick of negative values being reported. There were few negative values reported in the spring and autumn.
# Missing values
sum(is.na(combined_data$PM25))
[1] 0
# Negative values (not possible for PM2.5)
sum(combined_data$PM25 < 0)
[1] 215
# Very high values (above 1000 µg/m³)
sum(combined_data$PM25 > 1000)
[1] 0
# Missing values by year
tapply(is.na(combined_data$PM25), combined_data$Year, sum)
2002 2022 
   0    0 
# Negative values by year
tapply(combined_data$PM25 < 0, combined_data$Year, sum)
2002 2022 
   0  215 
# Very high values by year
tapply(combined_data$PM25 > 1000, combined_data$Year, sum)
2002 2022 
   0    0 
# Total observations by year
table(combined_data$Year)

 2002  2022 
15976 59918 
#Temporal changes
## Filter negative values
neg_2002 <- combined_data[combined_data$Year == 2002 & combined_data$PM25 < 0, ]
neg_2022 <- combined_data[combined_data$Year == 2022 & combined_data$PM25 < 0, ]

## Ensure Date is in Date format
neg_2002$Date <- as.Date(neg_2002$Date)
neg_2022$Date <- as.Date(neg_2022$Date)

# Histogram for 2002 (no negative values, so chart is blank)
ggplot(neg_2002, aes(x = Date)) +
  geom_histogram(binwidth = 7, fill = "darkred", color = "white") +
  labs(title = "Negative PM2.5 Values in 2002",
       x = "Date", y = "Count") +
  theme_minimal()

The histogram for 2002 is blank to indicate there are no negative values. All negative values were reported in 2022.
# Histogram for 2022
ggplot(neg_2022, aes(x = Date)) +
  geom_histogram(binwidth = 30, fill = "darkred", color = "white") +
  labs(title = "Negative PM2.5 Values in 2022",
       x = "Date", y = "Count") +
  theme_minimal()
Warning: Removed 101 rows containing non-finite outside the scale range
(`stat_bin()`).

Exploring Spatial Data at Three Levels

Primary Question: Have daily concentrations of PM2.5 decreased in California over the 20 years spanning from 2002 to 2022?